DATABASE 

The Journoi ot Biological Doiobose* ond Cu'ow>n 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



Original article 

The eGenVar data management 

system — cataloguing and sharing sensitive 

data and metadata for the life sciences 

Sabry Razick 1 , Rok Mocnik 1 , Laurent F. Thomas 1 , Einar Ryeng 1 , Finn Drablos 1 and Pal Saetrom 1 ' 2 '* 

department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Prinsesse Kristinasgt. 1, NO-7491 
Trondheim, Norway and department of Computer and Information Science, Norwegian University of Science and Technology, Sem Saelands vei 9, 
NO-7491 Trondheim, Norway 

♦Corresponding author: Tel: +47 98203874, Fax: +47 73594466, Email: pal.satrom@ntnu.no 
Submitted 15 November 2013; Revised 8 February 2014; Accepted 4 March 2014 

Citation details: Razick,S., Mocnik,R., Thomas,L.F., et al. The eGenVar data management systemcataloguing and sharing sensitive data and 
metadata for the life sciences. Database (2014) Vol. 2014: article ID bau027; doi:10.1093/database/bau027. 



Systematic data management and controlled data sharing aim at increasing reproducibility, reducing redundancy in work, 
and providing a way to efficiently locate complementing or contradicting information. One method of achieving this is 
collecting data in a central repository or in a location that is part of a federated system and providing interfaces to the data. 
However, certain data, such as data from biobanks or clinical studies, may, for legal and privacy reasons, often not be 
stored in public repositories. Instead, we describe a metadata cataloguing system and a software suite for reporting the 
presence of data from the life sciences domain. The system stores three types of metadata: file information, file provenance 
and data lineage, and content descriptions. Our software suite includes both graphical and command line interfaces that 
allow users to report and tag files with these different metadata types. Importantly, the files remain in their original 
locations with their existing access-control mechanisms in place, while our system provides descriptions of their contents 
and relationships. Our system and software suite thereby provide a common framework for cataloguing and sharing both 
public and private data. 

Database URL: http://bigr.medisin.ntnu.no/data/eGenVar/ 



Introduction 

Storing data for future reuse and reference has been a crit- 
ical factor in the success of modern biomedical sciences and 
has resulted in several landmarks, such as the UniProt (1) 
and GenBank (2) collections of protein and nucleotide se- 
quences. Motivations for these efforts include providing 
proper references to data discussed in publications and 
allowing published data to be easily reused for new discov- 
eries. This way, proteins with known functions have, for 
example, been used to infer properties of newly discovered 
genes. Similarly, more recent efforts, such as the encyclope- 
dia of DNA elements (ENCODE) (3), 1000 genomes project 
(4) and cancer genome projects (5), are reused and 



combined to uncover links between, for example, chroma- 
tin structure and mutation rates (6) or common disease- 
associated DNA variants and regulatory regions (7). 

Another motivation for storing data is reproducibility: 
having data and information on how the data were pro- 
duced and analysed — so-called metadata — enables replica- 
tion studies. Importantly, the additional metadata is critical 
for reproducibility (8). Although open archives such as 
GenBank (2) and ArrayExpress (9) are invaluable as tools 
to promote free reuse of published data, open archives 
bring risks of exposing private information about study 
subjects. By using data from freely available genealogy 
databases, Gymrek and colleagues could, for example, 
identify the surnames of several participants in the 1000 



© The Author(s) 2014. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ 
licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly 
cited. Page 1 of 16 

(page number not for citation purposes) 



Original article 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



genomes project (10). Current archives for individual gen- 
etic data, such as microarrays used in genome-wide associ- 
ation studies (GWAS), therefore have access-control 
mechanisms that typically only allow access to pre- 
approved studies (11). Nevertheless, whereas access to gen- 
etic data should be controlled because of privacy concerns, 
the metadata that describe the technical aspects of experi- 
ments and analyses can be freely shared. Indeed, sharing 
technical data such as the microarray platform used, the 
phenotype studied and the number of subjects that parti- 
cipated in a GWAS study would not breach privacy but 
could facilitate reuse of the data in future studies. 

The NCBI resource dbGaP (http://www.ncbi.nlm.nih.gov/ 
gap), which is a central repository for genotype and pheno- 
type data, extracts such metadata from submitted datasets 
and provides an interface to search through them. Data 
need to be submitted to dbGaP before the metadata are 
extracted and access restrictions can be imposed on up- 
loaded files. Similarly, a locally installable web-based re- 
source that uses metadata to find and share data is the 
SEEK software (12). In the SEEK system the metadata are 
extracted from the data included in the system and orga- 
nized using an approach based on the ISA tools (13). One 
common feature of these systems is that the files need to 
be added to the system and then some form of data extrac- 
tion takes place. This may lead to requiring additional 
authorization and security precautions when working 
with sensitive data. Therefore, a system that does not 
access the file content directly and collects only the meta- 
data required to advertise their presence would be a valu- 
able resource. 

Metadata relevant to files generated in the life science 
domain can broadly be grouped into three classes: informa- 
tion about files, file contents or relationships between files. 
File information includes details such as file name, owner, 
date created and location path. These context-independent 
metadata can be obtained directly from the operating 
system or associated tools (such as the Unix command 
stat), are valid across different computational platforms 
and file types, and are generally useful across scientific 
domains. However, to interpret data, users require 
domain-specific details about file content (second class of 
metadata). For biomedical data, for example, attributes like 
experimental protocols followed, technologies used, dis- 
ease conditions investigated and genes involved could all 
be necessary to make the data meaningful. Ontologies and 
controlled vocabularies are approaches to make this pro- 
cess of storing domain-specific file content metadata 
systematic and standardized. Moreover, the biomedical re- 
search community has developed standards that describe 
the minimum information that should be present for data 
generated by different technologies to be considered reus- 
able and reproducible (14). Although some of these stand- 
ards are collections of general titles rather than descriptive 



checklists, others, such as the Minimum information about 
a microarray experiment (MIAME) (15) are actively used by 
data repositories, such as ArrayExpress (16), to guide the 
data submission process. 

The third important aspect of data management is the 
relationships between different data files or entities (third 
class of metadata). Such relationships include both sample 
information — for example, that two files are generated 
from the same or related biological samples — and process 
information or provenance — for example, that a result file 
was generated by using a specific program and auxiliary 
data to analyse a set of data files. In general, maintaining 
such hierarchical relationships is more complex than keep- 
ing related files in project-specific folders and sub-folders. 
To illustrate, some relationships, including inbreeding, can 
only be described by a directed acyclic graph (DAG), which 
excludes tree-based file systems from being used 
(Supplementary Figure S1). Moreover, strictly using file sys- 
tem placement to represent relationships would often re- 
quire altering files; for example, a file that originally 
contained data on 1000 individuals would have to be split 
into 1000 separate files and placed in a structure that re- 
flects file and sample relationships. 

Scientific workflow management systems, such as Galaxy 
(17) or Taverna (18), use DAGs to model and record tasks 
and dependencies. Although these workflows are primarily 
used to achieve reproducibility, they could also be used to 
determine data required, processing steps, order of pro- 
cessing and software used at each step. Workflow systems 
also provide a way to share data, as others with access 
rights can use the data included in the system. Whereas 
these systems excel in what they are designed to do, work- 
flow systems have several limitations if used as a general 
data management system. 

• Data provenance is only available from the point where 
the data were introduced to the system and only for the 
processing recorded in the workflow structure. For ex- 
ample, raw data that were modified before being intro- 
duced to the system, processing steps performed by 
proprietary software such as GenomeStudio (lllumina 
Inc.), or original versions of anonymized data from a clin- 
ical study, would not be available through such a system. 

• Data exported from the workflow system loses its prov- 
enance. Such exports could, for example, be necessary 
when using novel methods to process data. 

• Relationships between data introduced or present in 
the system, such as data from similar or related sam- 
ples, are not readily part of data analysis workflows, 
but have to be recorded in separate systems such as 
Laboratory Information Management Systems (LIMS). 

• Data must be moved from their original locations to be 
advertised and shared through workflow systems. But 
data containing sensitive information may have legal 



Page 2 of 16 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



Original article 



restrictions or requirements for added security that 
cannot be met by standard workflow systems. 
• File handling is opaque such that file names and loca- 
tions are only meaningful within the system. Uploaded 
files may, for example, be internally renamed and pro- 
cessing results may be stored in a system-defined struc- 
ture, so that data cannot be located without using the 
workflow system. 

The Synapse project management system (https://www.syn 
apse.org/) by Sage software is a commercial workflow 
system that has addressed some of these limitations by 
providing data sharing options for local and distributed 
data. Data owners can share data through Synapse by 
making the data public or shared with a team of specific 
users, put private data remain hidden so that the system 
cannot be used to advertise such data. 

Although having detailed metadata on content, proven- 
ance and sample relations enables both replication and 
reuse, thereby increasing the data's scientific impact, scien- 
tists will in reality weigh these benefits against the labour 
required to report and maintain the metadata. Clearly, exist- 
ing information should be easily reusable without having to 
provide them again. Moreover, inherent relationships 
should be captured and the origin and progression through 
time should be recorded as it happens. Further, metadata 
reporting should be possible in stages with the involvement 
of different people and without requesting all details up- 
front. Free text descriptions of content should be possible, 
but relevant ontologies and controlled vocabularies should 
be available and used for structuring reporting and main- 
taining uniformity. Additionally, data providers having sen- 
sitive and private data should be able to report the presence 
of the data without violating ethical or legal requirements 
by exposing the data itself. Finally, the reporting process 
should be possible as part of routine work and not some- 
thing left until there is an obligation to do so. 

At the same time, the technologies and methods used in 
research are becoming more advanced and increasing in 
number. This increases the complexity and quantity of 
the data produced, which tend to take a variety of formats. 
In addition, lack of a central catalogue for research data, 
inability to host sensitive data on public servers, and priv- 
acy and security concerns make it hard to locate existing 
data leading to redundancy in research and ambiguity 
in resources used. These issues have been discussed in ini- 
tiatives like the Big Data to Knowledge program (BD2K) 
(http://commonfund.nih.gov/bd2k/). Specifically, BD2K sug- 
gests that a resource that stores research data and meta- 
data will be as valuable as PubMed is for publications. 
Currently, the European life sciences Infrastructure for 
biological Information (ELIXIR) (http://www.elixir-europe. 
org/) project is working on a coordinated infrastructure 
for sharing and storing research data from the life sciences. 



This effort brings an opportunity to catalogue the vast 
amount of data collected from many sources; the challenge 
is to create systematic methods and standard tools to 
achieve this task effectively and efficiently. 

A data catalogue with content summary information 
would be a very useful for reporting the presence of sensi- 
tive data as well. As a specific example, for biobanks, data 
reproducibility and reuse are important, as the banks typ- 
ically have limited quantities of biological samples such as 
blood, saliva or urine, and acquiring new samples is costly 
or infeasible. At the same time, data produced from such 
samples such as gene expression or genetic variation meas- 
urements, can potentially be reused to investigate other 
phenotypes than the study that produced the original 
data investigated. Consequently, both the biobank's 
phenotypic diversity and its ability to track data content, 
provenance and sample metadata are critical factors for the 
biobank's scientific impact. 

This paper describes a method and a software suite for 
sharing information without compromising privacy or se- 
curity. This system constructs a metadata catalogue, which 
could be used to locate data while the original files remain 
in a secure location. Specifically, our system, called the 
eGenVar data-management system (EGDMS), allows users 
to report, track, and share information on data content, 
provenance and lineage of files (Figure 1). The system is 
designed to bridge the gap between current LIMS and 
workflow systems and to keep provenance for data being 
processed through disparate systems at different locations. 
Central to our system is a tagging process that allows users 
to tag data with new or pre-existing information, such as 
ontology terms or controlled vocabularies, at their conveni- 
ence. The system is available as a self-contained installable 
software package that includes a server, command line 
client and a web portal interface. The system is lightweight 
and can be easily installed as a utility on top of existing file 
management systems. The main use-case for the EGDMS is 
handling and integrating anonymized biobank data, but 
the system has a flexible and extensible design to accom- 
modate many other types of data used or generated in 
biological and medical research environments. In total, 
the EGDMS addresses one of the key challenges in current 
scientific resource and data management: allowing flexible 
annotation, sharing and integration of public and private 
biomedical research data. 

Results 

The following sections describe our solution for managing 
metadata: the EGDMS. We first define the terminology we 
use to describe different aspects of our system and the spe- 
cific metadata our system handles. We then describe our 
implementation, how EGDMS manages users and access 
control, and the processes of depositing data, handling 



Page 3 of 16 



Original article 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



Network inaccessible 
local storage 




eGenVar 

oa taManage/n e „ f 
System 



j 



Figure 1. Types of metadata handled by EGDMS. Data originated from wet-lab experiments or computational analysis can both 
be raw data (1). Raw data subjected to processing become processed data (5). Processed data can then be raw data to subse- 
quent processing steps (3). Provenance captures how data originated and the processing steps required in reaching their current 
state. Data lineage records the hierarchical arrangement of the entities by considering the entities required to generate them 
(parents) and their involvement in producing other entities (children). The box FL shows how this information can be represented 
as a graph (see Figure 3 for more on details on how the box FL is derived using relationships). Processing may require auxiliary 
data from public databases (4). Such auxiliary data include resources like a reference genome, gene annotation, or data from 
instrument and reagent manufactures. This auxiliary data are a part of the data lineage information, although it may not be 
stored with the other data. Data can reside in common secure storage, storage accessible via network including cloud storage 
solutions, local storage not connected to the network, or in external storage, devices (e.g. tape archives, removable disks). File 
system metadata such as location and owner are directly acquired by the EGDMS system. Provenance and content description are 
included as desensitized tags. 



errors, updating and synchronizing data, and retrieving 
data. Finally, we provide a case study on how EGDMS can 
manage metadata on diverse GWAS studies in a biobank 
setting. 

Terminology 

The ISO 15489-1 documentation (19) defines metadata as 
data describing record context, content, structure and man- 
agement through time. For example, data files have name, 
creation date and ownership information as metadata. In 
addition to such file system metadata, which can be ob- 
tained directly from the operating system, this article con- 
siders provenance, data lineage and content description 
metadata, which are described in detail under metadata 
extensions. 

File system metadata can be obtained from basic system 
services; for example, through the UNIX systems call stat or 
the Apache Commons Net library™ (http://commons. 



apache.org/proper/commons-net/). In our system, the cap- 
tured metadata translate into entities and relationships be- 
tween them. An entity is the database representation of a 
real-world object or a concept with a unique set of attri- 
butes (20) . For example, details about files are captured by 
the database entity f/7es. The set of metadata about a file is 
a record. The person creating such a record is a reporter. 
The files are organized in a research project, publication or 
an experiment-based criteria known as a report. Related 
reports are grouped under a report_batch (Figure 2). The 
f/7es can be raw data, processed data or auxiliary data. Raw 
data are generated from experiments to start with and op- 
erations on it produce processed data. The processed data 
when used as input in subsequent processing become raw 
data to that step. Data from public databases, reference 
genomes, data sheets and product identification sheets 
from instrument or reagent manufactures are some ex- 
amples of auxiliary data (Figure 1). 



Page 4 of 16 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



Original article 



Reference_genome_l 







GRCh 37pl0 




Figure 2. The basic arrangement of an EGDMS record. Files 
are grouped under reports and related reports are grouped 
under report batches. The source of the samples is the donor. 
Solid lines show group membership information with arrow- 
heads pointing from the group title to the members. Dashed 
lines show data lineage relationships with arrowheads point- 
ing from parents to children. The files F i I e_ 1 and F i I e_3 con- 
tain results from the same experiment; therefore, they are 
grouped under the same report (Report_1). The experiment 
was conducted using samples A and B_1, which originated 
from the donor DonoM. F i I e_2 is a normalized version of 
File_1 and does not have a direct link to a sample, but it is 
linked as a child to File_2 and grouped under the same report. 
This arrangement extends the relationships of File_1 to File_2. 
Report_2 contains the results (File_4) of a bioinformatics ana- 
lysis conducted using the files in the Report_1. Therefore, 
Report_2 is a child of Report_1. Report_2 uses a reference 
genome connected to a report (GRCh 37p10) in the report 
batch called Reference_genome_1 (file level details not 
shown). Note: grey boxes represent data and white boxes rep- 
resent samples. 



Metadata extensions 

We use three extensions to the basic file system metadata 
for effectively describing data from the life sciences domain 
(Figure 1). Provenance describes how the data originated 
and the processing steps required in reaching the current 
state. For example, the details about the instruments used 
and protocols followed during an experiment becomes 
provenance information for the data produced from that 
experiment. The relationship information manifested be- 
tween data, when the provenance is fully recorded is the 
data lineage (Figure 3). The data lineage is a hierarchical 
arrangement of the entities considering the parent entities 
required to generate them and their involvement in produ- 
cing other entities (children). For example, the donor is the 
parent of the samples and a normalized version of a file is a 
child of that file. The content description describes the con- 
tent of a file by using tags, constructed using controlled 
vocabularies and other terms relevant to describing the 
content. Controlled vocabulary terms are acquired from 
public resources, which standardize the terms used in 



describing experiments and results. The Open Biological 
and Biomedical Ontologies (OBO) (http://obofoundry.org/) 
and the phenotypic details of the content such as disease 
conditions investigated are an example of content 
descriptions. 

Implementation 

The EGDMS is developed using Oracle® Java enterprise soft- 
ware (Java EE) (http://www.oracle.com/technetwork/java/ 
javaee/) and consists of a server and a command line and 
a web interface clients. The server uses a relational data- 
base for user account management and data storage. The 
communication between server and the database is 
achieved using Java Database Connectivity API (JDBC) 
(Figure 4). The command line tool uses web-services to con- 
nect to the server. The default database management 
system (DBMS) embedded in the system is Java DB (http:// 
www.oracle.com/technetwork/java/javadb/) which is 
Oracle's supported distribution of the Apache Derby 
(http://db.apache.org/derby/) open source database. 
Alternatively, dedicated database servers such as MySQL 
(http://www.mysql.com/) could be used with the system 
for faster, high volume operations using the connectors 
provided. The default application server is the embedded 
version of Sun Java Application server code named 
Glassfish. 

The command line tool and web interface cover two dif- 
ferent ranges of users. The UNIX command line tool egenv 
covers one extreme where the user invokes commands 
from the UNIX terminal. Using this tool requires basic com- 
petence with the UNIX terminal operations, but allows 
experienced users script-based access to EGDMS. The 
other extreme offers a user-friendly graphical user inter- 
face, invoked from a web browser. The web-services 
expose an application-programming interface (API), which 
third party software can use to communicate with the 
system. The users are free to use one interface or go be- 
tween different ones to complete different tasks. This ap- 
proach allows users with different levels of competence to 
achieve the same set of tasks. 

The EGDMS software suite and the programming code 
described here are available for download from http://bigr. 
medisin.ntnu.no/data/eGenVar/. All products developed by 
the authors are free of charge and distributed under the 
Creative Commons Attribution 2.5 Generic license (http:// 
creativecommons.org/licenses/by/2.5/). The third party com- 
ponents that the system depends on carry their own licence 
agreements, but these generally follow free and open 
source requirements and are described in detail in the 
documentation. Detailed system requirements, installation 
instructions and the client usage tutorials also available 
from the above link. 



Page 5 of 16 



Original article 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 




Experiments/Instrument 



* ^ Raw data (Fl) 

^ Post-proc essing 



Groomed data (F2) 



Analysis 



Annotation Processed 
resources (FA) | data (F3) 



^Filter 

Filtered 
data (F4) 




Figure 3. Different types of data lineage information captured by EGDMS. Box SL shows relationships between a donor, the 
original sample and sub-samples. As the sample was obtained from the donor and samples B.1 and B.2 are aliquots of the 
original sample, this is a natural existing relationship. Box FL shows relationships between files processed through a series of 
analyses. The data in the groomed data file (F2) is derived from the raw data files (F1). The annotation resources (FA) are used 
when generating file F5 from file F3. When this information is recorded as provenance, data lineage can be constructed as 
shown in the figure. Solid lines indicate the direction of the flow, where the arrow points from parent to child. Dashed lines 
indicate the link between the samples and the data generated using them. 




Figure 4. The basic components, their connectivity and the 
security levels of the EGDMS. Embedded JavaDB is the default 
DBMS and resides in the level one (L1) security container. The 
database is configured in non-network accessible mode and 
the users cannot directly interact with the database. The ap- 
plication server is in level two (L2) security and communicates 
with L1 using the JDBC API and maintains a connection pool 
for the applications to access. The application server should be 
started by the same process as the database server and re- 
quires a password to connect to the database server. The ap- 
plication server limits access by requesting a user name and a 
password. It also manages user roles (edit, search only etc.). 
The level three (L3) security is currently not implemented by 
the EGDMS, and is part of the server hosting it and represents 
a system firewall. The web-services and the web interface are 
accessible from the open ports of the host server. The egenv 
tool with a web-service client resides outside the security bar- 
riers and requires authentication to access the resources. 



Managing users and access control 

Users need to have an account to access the system. Users 
can associate search, data entry, edit and update privileges 
to their accounts during registration. By default, user veri- 
fication uses the email address provided during registra- 
tion. Optionally, two-factor authentication can be used 
during registration if the system is configured to use a 
short message service (SMS). 

Users can login and use the web interface once their user 
accounts are activated. For the command line tool, users 
need to register the tool's access location with the web- 
service by obtaining a non-transferable certificate. The 
term non-transferable means that each user needs to 
obtain an individual certificate for each installation of the 
tool. Once authenticated, the tool can be used from the 
same system user account without any other login require- 
ments as long as the certificate is valid. Authenticated users 
can immediately start using the system for search oper- 
ations. However, a personal profile needs to be created 
before adding, deleting or updating content. The system 
is compatible with additional security layers, such as fire- 
walls based on iptables (http://www.netfilter.org/) rules, 
but the EGDMS does not provide its own firewall configur- 
ations, as all the installation tasks are designed for users 
without administrator privileges and to maintain cross plat- 
form compatibility. 



Page 6 of 16 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



Original article 



Depositing data 

Data depositing is the process of reporting the presence of 
a file and attaching all relevant meta-information with it. 
This process has three steps. 

(1) Create records. 

(2) Attach data lineage. 

(3) Tag records. 

The three steps can be executed simultaneously or at dif- 
ferent stages, involving one or more depositors. The re- 
ported files are organized in reports and report batches 
(Figure 2). By default, the system uses folder names to 
infer this arrangement. Specifically, when invoked on a 
file inside a folder, the system considers all files inside the 
folder to be in the same report and the parent folder will 
designate the report_batch. Note that relative paths are 
used when deciding report and report_batch names while 
the absolute path of the files are recorded. Consequently, 
files that logically belong to the same report but are dis- 
tributed in different sub-folders or on different servers can 
automatically be added to the same report as long as the 
names of the folders and parent folders are identical at the 
different locations or such arrangement is simulated using 
virtual links. Alternatively, the depositor can construct this 
arrangement by manually specifying the names for reports 
and report_batches during the depositing process. In add- 
ition to this grouping, the file system metadata of the 
files, the checksum values, last modified date and owner- 
ship information are recorded during the depositing pro- 
cess. For the command line tool, issuing the command 
egenv -prepare creates a set of configuration files 
with the above details for the current directory and 
egenv -add adds the details to the database. These config- 
uration files can be edited or alternatively generated using 
third party programs before adding the details to the data- 
base. For the web interface, the depositor connects to the 
desired location through a secure connection, such as 
secure FTP, and selects the files to be added. EGDMS auto- 
matically creates a record with the structure as explained 
above, and a set of default values, which the depositor or 
others can edit subsequently if needed. 

After a record is created, data lineage information must 
be added. This information, which includes sample relation- 
ships and relationship between files, cannot be automatic- 
ally extracted from file system metadata and must 
therefore be provided by the depositor. The egenv com- 
mand line tool creates a template that can be filled in 
manually or automatically by, for example, using scripts 
to process sample sheets. In the web interface, data lineage 
can be added in batch mode or single record mode. The 
batch mode can add a child or a parent to multiple entities 
at a time, whereas the single mode adds one relationship at 
a time by selecting two entities and their relationship. 
Creating records and attaching data lineage information 



are the only compulsory steps of the depositing process. 
The two steps can be quickly accomplished by relying on 
the default values provided by the system; however, the 
system's usefulness depends on the extent depositors pro- 
vide additional information by customizing these default 
values and by tagging the submitted records. 

The EGDMS uses tags based on controlled vocabularies 
and standardized terms to attach provenance information 
and to describe file contents (Figure 5). The controlled vo- 
cabulary includes details obtained from OBO. The standar- 
dized terms come from a wide variety of sources including 
the minimum information checklists from the MIBBI 
(Minimum Information about a Biomedical or Biological 
Investigation), array identifiers and internal user-provided 
identifiers. All the terms used for tagging are arranged in a 
hierarchical manner. This arrangement has three main 
benefits. First, individual tags can be made less verbose as 
they inherit properties of their parents. Second, users can 
create new tags under the existing ones to achieve higher 
granularity, making the process extensible. Third, existing 
and created tags can be located and reused by traversing 
the hierarchy. 

When tagging, the system guides the depositor through 
a step-by-step process to create a tagging template by se- 
lecting the tags and the entities to be tagged. During this 
process, the choices made at each step affect subsequent 
steps resulting in a uniquely tailored tagging template. The 
controlled vocabularies and standardized terms help to 
guide this process, but also link the tagged entities to es- 
tablished public knowledge bases as well as with each 
other, thereby increasing the power of the search and 
retrieval process. 

Submitting data and handling errors 

Duplicated files waste resources and are sources of incon- 
sistencies. When creating file records, the prepare step 
therefore uses checksums to check for duplicated files 
already contained in the system. The system outputs the 
results of this step in a log file along with information on 
processing errors and warnings. The depositor can then 
handle any duplicates by removing them from the gener- 
ated template, as only the files listed in the template are 
considered for processing in subsequent steps. 

It is also possible to control the files added by specifying 
them during the add step. This option is useful when the 
depositor wants to exclude some files in the directory pro- 
cessed or wants to create multiple reports from files in the 
same directory. When adding, entities already in the data- 
base are ignored. Therefore, if there is an error during an 
add step that involves multiple files, the depositor can cor- 
rect the error and rerun the same command without wor- 
rying about creating duplicates. At the same time, if 
maintaining multiple copies of files is required, then one 
of those files can be recorded in the EGDMS and multiple 



Page 7 of 16 



Original article 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



File - 



Tags 




Type 1 Diabetes A 

• Biological ontologies 

• DiabetesMellitusDisorder.obo 

• Glucose Metabolism Disorders 

• Diabetes Mellitus 

• Diabetes Mellitus Insulin Dependent 



The raw data for each 
hybridization 

• MIAME (Minimum 
Information About a 
Microarray Experiment) 

• A term in the minimum 
information checklist for 
reporting 



SeGlu2H@NT3Dia2#4M(Serum 
Glucose 2 Hours) 

• NT3Dia2#4M(HUNT3 Diabetes 
project, round 4, measurements) 

• NT3Dia(HUNT3 Diabetes Study) 

• NT3(HUNT3) 

• HuntbiobanktagsT3(HUNT3) 



Figure 5. Tagging files. The file shown here has three tags: two from public databases (OBO = Open Biological and Biomedical 
Ontologies, MIBBI = Minimum Information about a Biomedical or Biological Investigation) and one from a private source 
(HUNT = The Nord-Trondelag health study). The OBO tag OBO = 37 039 refers to the term Type 1 Diabetes A' and is an internal 
identifier used to ensure persistence between different OBO versions. As the OBO has a hierarchical arrangement, this file will be 
connected to other files describing 'Type 1 Diabetes A' and to, for example, files describing other types of glucose metabolism 
disorders. The MIBBI = 722 tag tells that the file contains raw data for each hybridization of a microarray experiment. The 
information about the microarray used, protocols followed etc., will also be attached to this file this way (not show for simpli- 
city). The HUNT = 5613, tells that this file is related to the diabetes study of the phase three Hunt project, specifically 'Serum 
Glucose 2 Hours' values. 



paths can be attached to it. Moreover, when adding a large 
number of entries at once, the process is automatically split 
into stages and undo points are created at the end of each 
stage. The undo or delete options can then be used to 
safely remove any erroneously added entries. The EGDMS 
interfaces, including the API, automatically handle data- 
base level constraints and populate tables in the correct 
order. Consequently, API users do not need to know the 
EGDMS database schema. This also makes it possible for a 
single template to have data ending up in multiple tables in 
the database. 

Updating and synchronizing data 

Occasionally, metadata needs to be updated, for example, 
because a file is changed or relocated to a different folder 
or server, the file's report should be changed, or a newly 
introduced tag should be attached. Whereas users can 
update reports and tags by following similar steps as 
when depositing data, the egenv tool provides a specific 
option for identifying and handling multiple moved files. 
Specifically, the egenv -diff option compares the details of 
the files in the current location against the database and 
creates a report of the changes if there are any. Any 
changes to records can then be updated in the database 
by using a template-based approach similar to the process 
of adding data. When this operation is performed on a 
specific location, it will use checksum values to locate the 



files already deposited in the system and create a template 
of the changes required in the database. Second, after po- 
tentially modifying the template, an egenv -update oper- 
ation with the template as argument commits the changes 
to the EGDMS. As this process can be performed recursively, 
users can update all entries on an entire server at once. 
Alternatively, the egenv -update -current, will automatic- 
ally commit changes with default values without any user 
intervention. 

Retrieving data 

The EGDMS can provide users with the locations of de- 
posited data, but as accessing the data may require separate 
user privileges, the data must be retrieved by the users them- 
selves. These data locations can be retrieved by using the 
get-all method, by browsing the web server, or by using 
search queries, filters or scannable codes. The get-all feature 
provided with the command line tool retrieves all file entries 
of a selected entity; users can then filter the content and use 
third party software such as scp to retrieve the data. The web 
interface provides a browse feature that uses filters to go 
through data displayed in a tabular format without specify- 
ing a search query. As for searches, these can either be free 
text searches on specific database fields, such as files. na- 
me=experiment_122, or structured searches. 

Structured search can involve many tables in the database 
at once (Figure 6). This is achieved by using a graph of all the 



Page 8 of 16 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



Original article 




Figure 6. Graph arrangement of tables used in search operations. The graph is constructed using all the relevant tables (nodes) 
connected using connections (edges) constructed using foreign key constraints and polymorphic relationships. Self-loops indicate 
a hierarchical relationship. The black nodes and the dashed arrows show the path taken when answering the query 'get all files 
connected to any of the samples tagged with the term Type 1 Diabetes A'. (OBO = The Open Biological and Biomedical 
Ontologies, MIBBI = Minimum Information about a Biomedical or Biological Investigation, HUNT = The Nord-Trodelag health 
study). 



relevant tables (nodes) connected through foreign key con- 
straints and polymorphic association flags (edges). The poly- 
morphic association flags are similar to foreign keys but 
could point to any one of the designated tables instead of 
just one. This technique is used in the tagging process, where 
a tag could be a link to any one of the controlled vocabulary, 
annotation or minimum information database tables. Still, 
users can construct queries just by knowing which group of 
properties (files, reports, samples and so on) they want to 
search. When a query is issued involving more than one 
table, the shortest path between them is used to retrieve 
the results. This graph-based navigation method makes it 
possible to construct dynamic queries, using fewer resources 
than when performing joins on tables. 

Whereas searches can be done through both the com- 
mand line and the web interface, the web interface 
provides an interactive method to refine search results. 
For example, once a result is obtained, the web interface 
uses the graph arrangement, grouping and the data lin- 
eage information to create links to additional details. The 
grouping of files in reports and reports in batches makes it 
possible to create queries like 'get all members of the same 
group'. Data lineage is used to retrieve parents or children 
when that information is available; for example, the query 
'get all files used when generating the file file_0023' relies 
on data lineage information. 

Another search feature currently available with the web 
interface is to use scannable Quick Response (QR) codes to 
retrieve search results. Each search result generates a static 



URL and to repeat the search, this URL can be shared as a link 
or as a QR code. When shared as a QR code, any device with a 
QR decoding facility can reproduce the search results. In add- 
ition, all the sample and donor detail records have their own 
QR codes. This code can be used to label sample containers, 
for example, making it possible to use a code scanner such as 
a smart phone to retrieve sample details from the EGDMS. 

Case study 

Although EGDMS is a general system for managing meta- 
data, the system was specifically developed to handle the 
disparate data created from current biobank studies. 
Biobanks often provide samples, such as DNA, that colla- 
borating research groups use to produce large-scale data 
on genetic variation. Although this genetic data may have 
initially been produced to study a specific phenotype, the 
data can potentially be reused in multiple future studies. As 
biobanks typically have limited quantities of biological sam- 
ples, managing data produced by collaborating groups can 
become an important factor for a biobank's future scien- 
tific impact. 

One such example is the HUNT Study and Biobank, which 
has a rich set of phenotypes and stores human biological 
samples from ~120 000 people in Nord-Trondelag county, 
Norway (21). Currently, HUNT Biobank has participated in 
several large-scale studies that altogether have genotyped 
~34000 HUNT samples on technologies that include 
lllumina GWAS, metabo, immuno, exome, and combo 
chips and exome and whole genome sequencing (22-28). 



Page 9 of 16 



Original article 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



REPORT 
BATCHES 



REPORTS 



FILES 




RAW DATA 


->~ 


GENOTYPES 





-3406_R01C01_Grn.idat -GWAS#1.bed 
-3406_R01C01_Red.idat -GWAS#1.fam 
; -GWAS#1.bim 



PROJECT AND 
SUPPORT FILES 



-GWAS#1.bsc 

- Manifest.bpm 

- Samplesheet#1.csv 






RAW DATA 


- - 


GENOTYPES 





- 9637_R04C01 _Grn.idat - GWAS#2.bed 

- 9637_R04C01 _Red.idat - GWAS#2.fam 

;V\ - GWAS#2.bim 



- GWAS#2.bsc 

- Manifest.bpm 

- Samplesheet#2.csv 



SAMPLES 



DONORS 




Figure 7. An example of an EGDMS report. Two report batches represent two GWAS experiments. Both report batches use the 
same report structure, with one report each for raw data, project files and the final product: a report containing genotypes. Each 
sample is connected to all the files containing any kind of information regarding the sample. Study-specific samples are then 
connected to donor information. 



One of the obligations of these studies is to return all the 
genotype data back to the biobank. We are currently using 
the EGDMS to organize the data and metadata connected 
to all these studies. The following illustrates how data and 
metadata from two HUNT GWAS studies are stored in 
EGDMS. 

The structure of EGDMS reports is the same for almost all 
of the studies, since most of them are GWAS; (Figure 7) 
illustrates the structure for two GWAS studies. Each 
report batch, describing a study, contains three reports, cor- 
responding to three logical steps in the process of a study. 
The first report contains files holding all the raw data; in 
this case, intensity values originating from lllumina high- 
density SNP arrays. The second report contains all the sup- 
port files needed to interpret the raw data, including 
Genome Studio project files, sample sheets, manifest and 
cluster files. Last, the third report contains processed geno- 
type data; in these examples, the genotype data are in the 
PLINK binary PED format (29) (http://pngu.mgh. harvard. 
edu/~purcell/plink/binary.shtml). Each study receiving sam- 
ples from HUNT has study-specific sample identifiers. In the 
EGDMS report, sample identifiers are connected with any 
file that contains information about that sample such that 
sample identifiers are connected with raw data, project 
files, sample sheets and genotype data (Figure 7). In the 
end, donor information is connected to the study-specific 
sample identifiers. 

The original data resided on a server with limited access 
and this data server was different from the EGDMS server. 
To add the data to the EGDMS server, the egenv command 



line tool was installed on the data server and run to trans- 
fer the metadata to the EGDMS server. A large part of this 
process was automated by parsing the study sample sheets, 
which provided information such as sample IDs and unique 
microarray identifiers. One of the projects added was a 
study on cardiovascular disease, and that information was 
added as relevant tags. 

The web interface provides an overview of the studies, 
study metadata and access to information on where all the 
files are physically stored. Figures 8 and 9 illustrate two 
different ways the web interface can be used to find 
data. Using the web interface we can, for example, check 
for the number of participants in each study or find infor- 
mation about the technology used in the studies. The struc- 
ture described above also makes it easy to do quality 
controls such as identifying inconsistencies between studies 
and to export data for specific donors. First, using EGDMS 
we can identify samples that were used in multiple studies 
and the files that contain their raw and genotype data. 
These files can then be analysed to determine to what 
extent the different studies agree on the genotypes for 
the same donor. Second, the biobank often get requests 
for raw data for specified donors. In EGDMS, donor infor- 
mation is connected with the raw data and support files, so 
we can extract just the raw data for the specific sample and 
include support files, like SNP array cluster and manifest 
files. This way, the biobank can easily ship the requested 
data without including any of the data connected to other 
samples. 



Page 10 of 16 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



Original article 



TT 



s - 

2*' 
m ui 

0£ I- Q 



^ I- 

O < Q 

Q. CO m, 

LLI | I 

te. h x 



i 2 ' 

1 5 



£ < 2 c 

^ I o t 

5 £ is ^ < 

Q a E (j c 



c a: c b c ^ 

L > 2 > q > CL 

I Q) z a; £ a) Z 

■ a; O a; lu a; <; 



x w ro 

CD < T3 

s§« 

QJ E o 
E .2 -d . 



£ y= q= 

a! — P ■ 



!_ T3 

a» cd 

OO «/» 



CD 



CD £ 5 

u .t! w 

ro > cud 

t > nj 

^ ro ^ 



CD 

5 E 

cd .2 

t/> c 

3 q= 



CD 




on m o 



oo ro r\i 



vo ct> vo ct> 
ro cn ro r\i 



CO vO CO ID tH 



<S S t 



£ K § 
rs ro o 

(N U cm 



is ro ld 



ro r\i ro m 



U a) c* U 



O iH o iH o o 



<fr o <t o o 



<fr <fr <fr 



o a> o ct> o o 



LD O If) O LO LO 
LD LD LD LD LD LD 



Oi O <Ti O O 



LD O LD O LD LD 
LD LD LD LD LD LD 



E 5 



0) CD 0) CD 0) 0) 



I I I I 



ri n ^ oo id 10 n 

rl ^ ^ ^ l/l l/l lO 

rs <t <t <t <t <t <t 

If) iQ i£> iO i£> iO i£> 



E 
o 

1— 

CD 
i— 

'~G 

C 

o 



o 

4- 

c 

CD 

c 

■■d 

03 



^ o 

CD U 

to +-> 

03 03 

u E 

oo 

IE tJ 

■ +-» 03 

.E o 

>> CD 

s- {- 

0 -p 

_c o 

£ -o 

03 CD 

Si s. 

CD ° 
CD 

1/1 3 
03 



"D CD 

CD -C 

+-> +-> 

03 „ 

s a 



CD 



0 O 

M_ CD 

O CD 

£ CD 



£ 00 
* CD 

■ - CD 



_Q 03 

_ +-> 

= CD 

§ "° 

« J- 

en t: 

03 _ 

Q. £ 

*, o 

+-» 

03 £ 

-O CD 
CD 

-C CD 
-C 

C 

^ o 

■— oo 

_c ~ 

u 03 

03 CD 

CD "O 



CD ^~ 
U) CD 



O 



CD > 



c Q- 



CD 



v — ' 03 +-» 

E £ -o 

QJ 3 CD 

+j — oo 

£ >> =5 

Co CD 

oo _Q 

< "O 

x o 



u CD 

^5 

-=■ CD 
O > 
U (D 

2 CD 
CD — 

CD_q mZ 

^ >(D 

03 CD -g 

C +j CD 

c ro 

s- U ^ 

CD j_i i 

■M t_ >> 

— O 03 
U_ o 

Q. oo 

CD ?> "3 



CD ^ 
-M _Q 

U 03 



_C 03 

1° 

. - 03 

03 £ 
to O 
< U 



E r 
II 

03 
. +-> 

, CD OJ 

03 +-» 
oo 

< CD 

x 

E 
_2 
"c 



E 5 .£ 

« uo 
CD 03 



0 ^ 
T3 

CD C 

_C D 

■M O 

CD 4 - 

C CD 

• — l_ 

C CD 

o3 ^ 

u O 

oo CL 

CD CD 
03 



.e a w 

< ts 

.E Q CD 

r x -s 

S fc 03 

CD C Q. 

St" 

— oo 

"D CD 

C ~ 

U CD 

CD X CD 

£ 9" 03 

^ S.i 

^ ■£ CD 

^ O 1 
CD -M 

) ^ CD ■- 
*+r +- ' 

o < 

■M CD i 

CD N 

8 ts to. 

— T- 

cl To 



E ~ ^ -M -j 

^ > O m — 



CD 

■i « 



o 

"03 ^ 

o < 

M— i 



D _ 

j £2 > 

' "o v> 

n CD CD 

: u 03 

: D +-> 

i 2 9{ i 

i Q- -> " 

' CD j 

i o3 ! 

! = 8 ; 

LO _ I 



E £ 



o ^ £ 

u oo ^ 
CD 03 
03 — > 

03 ^1 CD 

Q.< ? 

uo ' 

0 C -m 

■M O 

v/ ■ 73 CD 

£ | 3 
10 o 

'5 '«u £ 

1 g I 

03 ^ +-> 
C _C CD 

I?* 



+-» oo CD -M < 



CD - C 



-O -m CD 

C 03 CD 

o JS 

t 03 +3 

Q. 03 £ 

CD -C ~ 



Page 11 of 16 



Original article 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



2- 3 



~ £ o 



U 0£ c 
^ 3 c 



U < U (9 m 

< * 2 ft 



5 uj 



U 0) 
(0 £ 
10 U 

§5 



(fl £ 
n> u 



ro E ^| 



< => < 3 j < ; 



hU tM tM m u 



u . 

oo > "O 

QJ QJ 

CD i- c 

^ =5 u 
ro LO 

TO O & 

D C 1/1 

c ^ a; 

— zj _Q 

"O CQ -o 

QJ . -E 

S 2i o 

£ ^ "5 

E 

a « a 

QJ 

3 X 



QJ 
Q. 



O (TJ 
■M OJ 
QJ CO 



sr is 

C "L. = 
OJ 5 



CO 








QJ 




_c 


DM 








Q. 


V 

CtO 


"D 

QJ 
+-> 


LU 

QJ 


cular 






logy 


itolo 


hligh 


_C 


to 






o 




CuO 


+-> 








+-> 


o 




c 


g 


QJ 




c 


QJ 






O 


>■ 




o 


to 




ries 


ardi 


spla 




ect 


isea 


ink. 


ent 


E 


T3 

QJ 




corr 


uman d 


ails 


logy 


i_ 

QJ 
+-> 


_Q 




QJ 
_C 

-M 


+-> 

QJ 
Q 


o 


QJ 






-M 


X 


_^ 


+-> 


_C 






U 




u 


c 


+-> 






QJ 






o 


_c 






QJ 


cub 


T3 




+-> 








<ii 


T3 


< 


<; 










an 



to 




QJ 


i_ 

_ro 


'i_ 




+-> 




C 


u 


QJ 


to 




CU 




> 


~ro 


o 


as 


rdi 




03 


_o 




nK 


"O 
s_ 




O 


to 






QJ 


+-> 


_C 


QJ 


-M 


Q 


_c 


QJ 




Th 


> 







+-» 03 03 Z5 



+-> QJ 

P "a 



■r oj 



* 1 

03 C 
Q> 



c 

to S- 

03 u 
QJ 



13 

cn 



03 
QJ U) 



O 
Q. 

O -o Si 
QJ 

QJ _ 



CD 

t! 

QJ 



x: oj 
cn ^ 

Z5 



n — qj 



"D <U 
C "O 
03 Z5 



U QJ C QJ 03 +3 _2> 



QJ 

^ f 

QJ O 
■M ^ 
U> °> 
" — to 
QJ 

_ Q. 

° E 



03 



H— 03 (~ +- ' to 



03 g JZ ~ 

- "D QJ ^ 

j- c y oo 

C 03 



13 
O M= 

to 

^ 01 
CD 03 
03 +-> 



03 "O 
-M c 



o To 



°'§ 1 

M— i_ QJ 
Q. 



E 

QJ 
+-> 

L. 
_fD 

13 
_Q 

03 



QJ 

E 
o 



"5 QJ h- 



= 1 

tO QJ 

"O ' 

QJ 



^ -o 

QJ 

1 & 

5= QJ 
03 

Q. QJ 

X _Q 

QJ 

>> QJ 
QQ _c 
. +-> 
0? ^ 



O QJ 



vw = M- QJ 



C 

03 
U 

O 
Q. 

QJ 



_ £ 

03 to 

M— +J 

iz to 

03 Qj 

QJ i- 

f= QJ 

Ef 

M— CD 



t3 
"o3 ^ 



QJ — 
03 ^ 



U) to 

.E lc 
§ o 



C7) 

_o . 

O 
+-> 

c 
o 

cn 
c 



QJ "O c +i 

oo C ^ > 

^ QJ 

03 cn 

3 ^ 

^ c 

03 Qj 

> QJ 
O 

S^? ^ 

03 ft3 

^ ns r 



~ o 

^ >> 

to- -f, 

=3 03 

to 1_ 

QJ QJ 

QJ 



QJ S- 



3 X 

QJ O 



CO QJ 

i 2 
+-> _o 

cn q. 

QJ 



o 

to +J 

E +-> 



= <: o 

"3 O -H 



.El 

QJ co 

3 ^ 

QJ Q 



QQ 



QJ 

yo 

QJ 03 
~ P- 



cn — 

■- ¥ 
tj QJ 

Q> 



Page 12 of 16 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



Original article 



Discussion 

The EGDMS was designed to report the presence of data, 
including sensitive or private data, and to maintain an ex- 
tended set of metadata about these data. Importantly, the 
EGDMS does not store the files themselves, but rather 
create a catalogue which includes location, provenance, 
content and lineage information. Therefore, the system 
can maintain uniform file records transcending file systems, 
access restrictions, security concerns, geographical locations 
and formats. These features are well suited for the primary 
use of this system at the moment, which is to manage data 
from research conducted on samples from the HUNT 
Biobank (http://www.ntnu.edu/hunt). 

Comparison with other systems 

The main feature of the EGDMS is its ability to locate files 
by using the extended metadata described in this article. 
The EGDMS has many advantages over traditional and ad 
hoc tools. 

• EGDMS does not require access to the files when 
searching. 

• The EGDMS can provide a collective view of all related 
files (grouped using a report) from different discon- 
nected locations as a single result. 

• The information on data lineage and the content de- 
scriptions provided by the tagging system is currently 
only available through the EGDMS. 

• Files that are compressed or archived in external stor- 
age can be included in searches without having to re- 
store them first. 

• The EGDMS caters to the users who do not have the 
terminal competency to effectively use command line 
tools such as find and grep, and do not want to 
make their own scripts or programs. 

There are popular and very good policy-based data man- 
agement systems like iRODS (https://www.irods.org/) which 
can provides some of the services of the EGDMS described 
here. The iRODS system in particular could manage and 
provide users with a uniform access method for data dis- 
tributed across heterogeneous systems. The Welcome Trust 
Sanger institute is using an iRODS solution for managing 
their high throughput sequencing data and metadata (30). 
In contrast to iRODS, EGDMS only manages metadata and 
does not federate access to or host the files themselves. 
However, compared to iRODS, EGDMS features a richer 
set of metadata that can capture additional domain-spe- 
cific metadata and hierarchical and temporal relationships 
between files and entities. In addition, EGDMS provides a 
consolidated view of distributed data rather than lists of 
individual files. Therefore, the EGDMS is a valuable com- 
plementing resource for file management systems like 
iRODS. 



Version control systems or source code management sys- 
tems like CVS (http://ximbiot.com/cvs/), SVN (http ^/subver- 
sion. apache. org/) or git (http://git-scm.com/) are very 
efficient in managing edits to files. The lineage information 
of the EGDMS described here should not be confused with 
such systems. In contrast to recording the different edits to 
a file, the data lineage records what other data are 
required to generate a certain file and the connections be- 
tween different transformation states during data process- 
ing. For example, lineage captures the connection between 
raw data and a normalized version of it. Another example 
is the connection between original experimental data such 
as sequencing data, additional data such as genomic refer- 
ences or gene annotations used when analysing the data, 
and results of such analyses such as lists of genetic variants 
or gene expression levels. 

Two other existing systems and strategies to manage 
genomics metadata are the ISA tools (13) and the 
eXframe system (31). Moreover, efforts like the Dataverse 
Network project (http://thedata.org/) and the Dryad Digital 
Repository (http://datadryad.org/) are successfully hosting 
and providing methods to share and cite research data. 
These systems have strengths in data archiving, sharing, in- 
tegration, standardizsation and visualization, but lack 
EGDMS' support for active data acquisition. The EGDMS 
actively participates in the data collection by collecting 
file metadata and creating records with default values 
automatically, by using existing information from the oper- 
ating system and the file system, and by providing reusable 
information in the form of tags. More importantly, these 
systems do not handle the data lineage and provenance 
relationships, which are the very root of our EGDMS 
design. Instead, the ISA tools can complement the EGDMS 
as they can be used to convert data into standard formats 
at the end of the collection process. Similarly, the eXframe 
(31) system can provide a good visualization platform for 
collected data. Considering all aspects, and especially our 
requirement for a flexible system for handling provenance 
and lineage aspects of data, we have developed EGDMS as 
a new system that complements and can interact with exist- 
ing systems. 

Installation 

Two hurdles when installing and maintaining software sys- 
tems are dependencies and compatibility issues. The 
EGDMS addresses these hurdles by minimizing third party 
dependencies and by using embedded solutions. 
Specifically, we have selected third party components 
with a good development base and have some collabor- 
ation; for example, the selected application server 
Glassfish (http://glassfish.java.net/) and the DBMS JavaDB 
(http://www.oracle.com/technetwork/java/javadb/) are clo- 
sely linked embedded systems. The advantage of this 



Page 13 of 16 



Original article 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



strategy is that the whole system is packaged as a single 
unit and ready to run without installing anything else. 

Security 

Our main security precaution is the strategy to 'avoid stor- 
ing anything worth stealing'. Although access to EGDMS is 
restricted to registered users, all information gathered by 
the system could be exposed to the public. Edit privileges, 
however, require much more control to prevent sabotage 
and accidental alterations. When activated, two-factor user 
verification using a mobile phone and email can be used for 
extra security. The certificate-based authentication used by 
the egenv tool is suited for terminal operations and avoids 
the need for the user to type in login credentials all the 
time. The diagnose instruction provided with the egenv 
tool helps to trouble-shoot and rectify problems if there 
are legitimate authentication issues. 

The system was primarily designed to catalogue research 
on biobank samples. Such research typically involves several 
levels of approvals and agreements and the data generated 
are considered to be sensitive information about the parti- 
cipating individuals. Rather than risk exposing such sensi- 
tive data by hosting the data themselves, the EGDMS 
maintains and exposes the non-sensitive information 
about the data. This simple solution to a complex issue 
allows biobanks to advertise and researchers to become 
aware that the data exists. 

The tagging process 

The tagging process described here provides depositors 
with a method to attach additional information about de- 
posited files and their content in a standardized way and 
with minimum effort. An alternative to this dynamic pro- 
cess would be for the depositor to fill in a form with all the 
information upfront. Enforcing depositors to provide 
standard information in a restricted way will generally 
result in records having more details, as evident by the 
record generated by the ISA (13) tool kit, which generates 
records accordance with the standards such as MIME (15). 
The downsides with enforcing depositors to provide all in- 
formation upfront are a higher threshold for depositing 
files with the system — that is, the more information re- 
quested, the less likely users are to routinely deposit their 
files — and less flexibility in adding to the files new informa- 
tion that were not covered by the forms during the data 
deposit. The tagging process has relaxed such restrictions to 
encourage regular use of the system, but this may result in 
depositors not bothering to provide all the tags necessary 
to describe the data. Therefore, curators should encourage 
good practice and make additional edits to complete the 
records. As users also can create new tags, curators should 
monitor the new tags and consider structuring these into 
new or existing ontologies and controlled vocabularies. 



Importantly, relevant existing data can easily be identified 
and retagged through the EGDMS system. 

When tagging is used effectively, it is easy to integrate 
the EGDMS with LIMS or sample management systems used 
in biobanks. The EGDMS can communicate with these sys- 
tems through static links to the web interface, through the 
web-services API suing custom build clients, or through 
wrappers for the egenv tool. Moreover, the QR codes gen- 
erated for samples and donors can link sample containers 
directly to the EGDMS and display details about the content 
by using a scanner. 

Potential uses and future plans 

This article's use-case is the HUNT biobank and the article 
describes how the EGDMS can be used to catalogue and 
advertise HUNT'S sensitive data. Specifically, the EGDMS 
provides a way not only to easily locate data but also main- 
tains the relationships between files and the relationships 
between the data and the samples used. The current system 
with open biological ontology and minimum requirements 
for reporting experiments is tailored to the life science 
domain. However, the concepts discussed here are valid 
across research domains and could be used to manage 
and advertise any type of research data. 

With the current setup, the ELIXIR (http://www.el ixir- 
europe.org/) project could benefit from the metadata man- 
agement concepts discussed here, and the tools are already 
available to set up a prototype system. The ELIXIR, as a pan- 
European research infrastructure for biological informa- 
tion, will collect massive amount of research data and a 
cataloguing system, which can be used for public data 
and sensitive data alike, is important for easily locating 
and integrating data. The provenance recording system, 
which focuses on how the data generated in an experimen- 
tal context, and the tagging system for data systematic de- 
scription would provide a standardized interface for the 
catalogue. 

Known issues with the EGDMS design 

As explained above, the relaxed requirements on adding 
metadata when files are deposited can lead to incomplete 
entries and thereby reduce the quality of the database con- 
tent. Another potential problem is missing information 
about provenance. Provenance can be automatically cap- 
tured by configuring computational pipelines or workflows 
to record processing steps and results in EGDMS. However, 
certain type of provenance, such as connections between 
data that has been processed by proprietary software or by 
collaborators off-site, has to be added manually. We do, 
however, consider the possibility to manually add proven- 
ance as one of the key features of EGDMS, as this is essen- 
tial for our main use-case: the HUNT biobank. 

Currently, updates to data already deposited in the 
system do not happen automatically. A user or a program 



Page 14 of 16 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



Original article 



Table 1. Additional data used in the tagging process and their sources 



Type 

Minimum Information guidelines from diverse bioscience 
communities 

Open biological ontologies 

The Nord-Trondelag health study (HUNT) 3 

Sequence Ontology 

a Not available in the public version of EGDMS. 



needs to call the diff and update operations for each regis- 
tered location to perform updates. In contrast, configuring 
a daemon process to monitor files and call update when 
appropriate would automatically handle updates. The 
EGDMS, however, was designed to run with minimum priv- 
ileges without administrator access. In fact, the security 
module prevents running certain operations as a system 
administrator. In addition, data kept in a location with no 
network connectivity with the EGDMS server cannot be 
automatically checked for changes. 

Technology choices 

There could be valid criticisms about the technologies used 
in the EGDMS system as well. For instance, a hierarchical 
DBMS rather than the relational one may better handle the 
data lineage information discussed here. Further, much 
simpler application servers than Glassfish, such as Jetty 
(http://www.eclipse.org/jetty/), can be used in applications 
like this. The database manager software was selected con- 
sidering overall requirements and not just an individual 
case. Specifically, ability to capture all of the relationships 
efficiently, documentation, developer community, licence 
for redistribution, cost, programmatic accessibility, main- 
tainability and interactivity with other third party compo- 
nents were the requirements that influenced decision. The 
Glassfish server was selected due to its collaboration with 
JavaDB, its security, its convenient methods for program- 
matic deployment of web-services, its regular upgrades 
and its programmatic configurability through available 
documentation. Having said that, the end-user has the 
option to replace the DBMS or the application server if 
desired. 

Ongoing improvements 

We have included a tag library with the EGDMS created 
using publicly available data. This library needs to be ex- 
panded by including details from instrument and reagent 
manufactures, more phenotypic data, disease conditions, 
file type descriptions and experimental protocols. Special 
purposes tools need to be included with the system to sim- 
plify and to automate the data collection and annotation 



Source 

Downloaded from http://mibbi.sourceforge.net/portal.shtml 
Downloaded from http://obofoundry.org/ 

Created using HUNT database export with kind assistance from 
Jon Heggland. 

Downloaded from http://www.sequenceontology.org/ 



process. For example, a client to extract tags from sample 
sheets and from VSF4 files is currently being designed. The 
current web-service operates on Simple Object Access 
Protocol (SOAP) and the addition of a Representational 
State Transfer (REST or RESTful) implementation would 
introduce more ways to interact with the EGDMS. The 
only command line client for the web-service is a Java- 
based tool and more clients and examples for other pro- 
gramming languages will follow. A beta version of the 
command line client for Microsoft Windows™ is available 
and will be improved to handle platform-specific issues. The 
synchronization operation of the system can be improved; 
for instance, local biobanks can have local servers, which 
are synchronized with a central server connecting all the 
biobanks. This way it is possible to have a central informa- 
tion catalogue with less access restrictions containing se- 
lected information from the local biobanks. The current 
system provides support for MySQL as a JavaDB replace- 
ment out of the box and more systems including 
PostgreSQL will be introduced in the future. 

Materials and Methods 

The development was carried out using the Netbeans IDE 
(https://netbeans.org/) on a workstation running Ubuntu 
12.04 with Oracle® Java development kit 1.7.0. During de- 
velopment and testing the Glassfish application server 
(http://glassfish.java.net/) and the MySQL DBMS (http:// 
www.mysql.com/) were used. Java Server Page (JSP) tech- 
nology, html and Java Scripts (JS) were used to make the 
web interface. The JDBC API handles the communication 
with the DBMS and the interfaces. The API for XML Web 
Services (JAX-WS, http://docs.oracle.eom/javase/7/docs/tech 
notes/guides/xml/jax-ws/) with Java Architecture for XML 
Binding (JAXB) technology (https://jaxb.java.net/) was used 
to create stub code for the web-service client used with the 
precompiled version of the egenv tool. The QR code gen- 
eration was done by using the zxing library (http://code. 
google. com/p/zxing/). The mail management was imple- 
mented using JavaMail™. The user needs to provide a 
Google mail (Gmail, http://mail.google.com/) account for 



Page 15 of 16 



Original article 



Database, Vol. 2014, Article ID bau027, doi:10.1093/database/bau027 



the mail notifications to work. Bug tracking and developer 
documentation was maintained using Trac (http://trac. 
edgewall.org/). The additional data used in the tagging 
process and shipped with the EGDMS system are listed in 
Table 1. 

The EGDMS source code is available from https://github. 
com/Sabryr/EGDMS. 

Supplementary Data 

Supplementary data are available at Database Online. 

Funding 

This work was supported by the eVITA programme of the 
Norwegian Research Council. Funding for open access 
charge: Norwegian Research Council. 

Conflict of interest. None declared. 

References 

LUniProt Consortium. (2009) The Universal Protein Resource 
(UniProt) 2009. Nucleic Acids Res., 37, D169-D174. 

2. Benson f D.A. f Cavanaugh,M., Clark f K. et al. (2013) GenBank. Nucleic 
Acids Res., 41, D36-D42. 

3. Dunham,!., Kundaje,A., Aldred,S.F. et al. (2012) An integrated en- 
cyclopedia of DNA elements in the human genome. Nature, 489, 
57-74. 

4. Abecasis,G.R., Auton,A., Brooks,L.D. et al. (2012) An integrated 
map of genetic variation from 1,092 human genomes. Nature, 
491, 56-65. 

5. International Cancer Genome Consortium, Hudson,T.J., 
Anderson,W. et al. (2010) International network of cancer 
genome projects. Nature, 464, 993-998. 

6. Schuster-Bockler,B. and Lehner,B. (2012) Chromatin organization is 
a major influence on regional mutation rates in human cancer cells. 
Nature, 488, 504-507. 

7. Maurano,M.T., Humbert,R., Rynes,E. et al. (2012) Systematic local- 
ization of common disease-associated variation in regulatory DNA. 
Science, 337, 1190-1195. 

8. loannidisJ.P., Allison,D.B., Ball,C.A. et al. (2009) Repeatability of 
published microarray gene expression analyses. Nat. Genet, 41, 
149-155. 

9. Rustici,G., Kolesnikov,N., Brandizi,M. et al. (2013) ArrayExpress 
update-trends in database growth and links to data analysis 
tools. Nucleic Acids Res., 41, D987-D990. 

10. Gymrek,M., McGuire,A.L, Golan,D. et al. (2013) Identifying per- 
sonal genomes by surname inference. Science, 339, 321-324. 

11. Craig,D.W., Goor,R.M., Wang,Z. et al. (2011) Assessing and mana- 
ging risk when sharing aggregate genetic variant data. Nat. Rev. 
Genet, 12, 730-736. 

12. Wolstencroft,K., Owen,S., du Preez,F. et al. (2011) The SEEK: a plat- 
form for sharing data and models in systems biology. Methods 
Enzymol., 500, 629-655. 



13. Rocca-Serra,P., Marco,B., Eamonn,M. et al. (2010) ISA software 
suite: supporting standards-compliant experimental annotation 
and enabling curation at the community level. Bioinformatics, 26, 
2354-2356. 

14. Taylor,C, Field,D., Sansone,S. eta/. (2008) Promoting coherent min- 
imum reporting guidelines for biological and biomedical investiga- 
tions: the MIBBI project. Nat. Biotechnol., 26, 889-896. 

15. Brazma,A., Hingamp,P., QuackenbushJ. eta/. (2001) Minimum in- 
formation about a microarray experiment (MIAME) — toward stand- 
ards for microarray data. Nat. Genet, 29, 365-371. 

16. Helen,P., Ugis,S., Nikolay,K. et al. (2011) ArrayExpress update— an 
archive of microarray and high-throughput sequencing-based func- 
tional genomics experiments. Nucleic Acids Res., 39, D1002-D1004. 

17. GoecksJ., Nekrutenko,A., TaylorJ. et al. (2010) Galaxy: a compre- 
hensive approach for supporting accessible, reproducible, and 
transparent computational research in the life sciences. Genome 
Biol., 11, R86. 

18. Wolstencroft,K., Haines,R., Fellows,D. et al. (2013) The Taverna 
workflow suite: designing and executing workflows of Web 
Services on the desktop, web or in the cloud. Nucleic Acids Res., 
41, W557-W561. 

19. ISO. (2001) standard for information and documentation. 

20. Elmasri,R. and Navathe,S. (2007) Fundamentals of Database 
Systems. Pearson Addison Wesley, Boston. 

21. Krokstad,S., Langhammer,A., Hveem,K. et al. (2013) Cohort profile: 
The HUNT Study, Norway. Int. J. Epidemiol., 42, 968-977. 

22. Hung,R.J., McKayJ.D., Gaborieau,V. et al. (2008) A susceptibility 
locus for lung cancer maps to nicotinic acetylcholine receptor sub- 
unit genes on 15q25. Nature, 452, 633-637. 

23. Morris,A.P., Voight,B.F., TeslovichJ.M. et al. (2012) Large-scale as- 
sociation analysis provides insights into the genetic architecture 
and pathophysiology of type 2 diabetes. Nat. Genet, 44, 981-990. 

24. Chan,Y., Holmen,O.L, Dauber,A. et al. (2011) Common variants 
show predicted polygenic effects on height in the tails of the dis- 
tribution, except in extremely short individuals. PLoS Genet, 7, 
e1 002439. 

25. International Genetics of Ankylosing Spondylitis Consortium, 
Cortes,A., HadlerJ. et al. (2013) Identification of multiple risk vari- 
ants for ankylosing spondylitis through high-density genotyping of 
immune-related loci. Nat. Genet, 45, 730-738. 

26. Purdue,M.P., Johansson, M., Zelenika,D. et al. (2011) Genome-wide 
association study of renal cell carcinoma identifies two susceptibil- 
ity loci on 2p21 and 11q13.3. Nat. Genet, 43, 60-65. 

27. Anttila,V., Winsvold,B.S., Gormley,P. et al. (2013) Genome-wide 
meta-analysis identifies new susceptibility loci for migraine. Nat. 
Genet, 45, 912-917. 

28. Johnson,M.P., Brennecke,S.P., East,CE. eta/. (2012) Genome-wide 
association scan identifies a risk locus for preeclampsia on 2q14, 
near the inhibin, beta B gene. PLoS One, 7, e33666. 

29. Purcell,S., Neale,B., Todd-Brown,K. et al. (2007) PLINK: a tool set for 
whole-genome association and population-based linkage analyses. 
Am. J. Hum. Genet, 81, 559-575. 

30. Chiang,G.T., Clapham,P., Qi,G. et al. (2011) Implementing a gen- 
omic data management system using iRODS in the Wellcome 
Trust Sanger Institute. BMC Bioinformatics, 12, 361. 

31. Sinha,A.U., Merrill,E., Armstrong,S.A. et al. (2011) eXframe: reus- 
able framework for storage, analysis and visualization of genomics 
experiments. BMC Bioinformatics, 12, 452. 



Page 16 of 16 



